Skip to content

[AURON #2366] fix: Handle Paimon metadata columns in V2 native scan#2367

Merged
SteNicholas merged 5 commits into
apache:masterfrom
lyne7-sc:fix/paimon_meta
Jun 30, 2026
Merged

[AURON #2366] fix: Handle Paimon metadata columns in V2 native scan#2367
SteNicholas merged 5 commits into
apache:masterfrom
lyne7-sc:fix/paimon_meta

Conversation

@lyne7-sc

Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Closes #2366

Rationale for this change

Paimon metadata columns are produced by the Paimon scan layer rather than stored as physical columns in data files. The Paimon V2 native scan was passing these columns to the native Parquet/ORC reader as file columns, which can return incorrect values.

For example:

create table paimon.db.t_metadata (id int, v string) using paimon;
insert into paimon.db.t_metadata values (1, 'a');
select id, __paimon_file_path from paimon.db.t_metadata;

The native path returned null for __paimon_file_path, while Spark/Paimon's scan path returns the actual file path.

What changes are included in this PR?

  • Recognize Paimon metadata columns using PaimonMetadataColumn.
  • Materialize supported file-level metadata columns (__paimon_file_path, __paimon_bucket) as per-file constants.
  • Keep unsupported Paimon metadata columns on Spark/Paimon's scan path instead of reading them from Parquet/ORC files.
  • Cover metadata columns both with and without table partition columns.

Are there any user-facing changes?

No API changes. This is a correctness fix for Paimon V2 native scan.

How was this patch tested?

Adds Paimon V2 integration tests

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot was unable to review this pull request because the user who requested the review has reached their quota limit.

@SteNicholas SteNicholas left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lyne7-sc, thanks for the fix! The overall approach is sound: materialize __paimon_file_path/__paimon_bucket as per-file constants via partitionSchema, and fall back to Spark for unsupported metadata columns. The functional Test Paimon 1.2 CI job (which runs the new integration tests) is green.

@SteNicholas SteNicholas self-assigned this Jun 28, 2026
@lyne7-sc

Copy link
Copy Markdown
Contributor Author

@SteNicholas Thanks for the careful review! Addressed the comments in the latest update, and the relevant ci is green now.

@SteNicholas

Copy link
Copy Markdown
Member

@lyne7-sc, could you provide your wechat which I could discuss with you?

@SteNicholas SteNicholas left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lyne7-sc, thanks for updates. The approach — reusing the partition-constant mechanism rather than reading these from Parquet/ORC — is clean. I found two correctness issues (verified against the decompiled Paimon 1.2.0 sources) plus a couple of value-fidelity edge cases and test-coverage gaps; inline comments below. Recommend addressing the two confirmed bugs (__paimon_file_path encoding, and the partition-key/metadata name collision) before merge.

Minor (not blocking): in toPartitionValueTemplate, SQLConf.get.resolver and indexByName = partitionKeys.zipWithIndex.toMap are split-invariant but rebuilt per split; and partitionKeys() is fetched twice (a Set at L131 and a Seq at L175). Worth hoisting into computePlan. Also consider whether file_path could be materialized on the executor (as NativeIcebergTableScanExec.metadataPartitionValues does) instead of baked into a per-file InternalRow on the driver.

@apache apache deleted a comment from lyne7-sc Jun 30, 2026

@SteNicholas SteNicholas left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. The native Paimon V2 scan correctly handles __paimon_file_path and __paimon_bucket, falls back for unsupported metadata columns, and properly distinguishes physical/partition columns that collide with metadata names. The test coverage (multi-file splits, non-zero buckets, name collisions, partitioned tables, special-character partition values) is thorough.

One thing I verified closely: the __paimon_file_path value is built with new Path(rawFilePath).toUri.toString. This matches Paimon 1.2.0, whose PaimonRecordReaderIterator materializes the column via filePath().toUri().toString() (the percent-escaped form) — confirmed against the paimon-spark-3.x:1.2.0 artifact. So results agree with vanilla Paimon, including the '50%' partition case. (Note for the future: newer Paimon switched this to Path.toString()/unescaped, so if Auron ever bumps the Paimon dependency this rendering will need to follow.)

A few optional, non-blocking nits:

  • PaimonScanSupport.scala — in toPartitionValueTemplate, the else if (isFilePathMetadataColumn(field.name)) null branch returns the same null as the trailing else (dead) and, unlike isPartitionValueField / filePathMetadataIndex, omits the !isPhysicalColumn guard. Safe today only because a physical column reaches partitionSchema solely as a real partition key; consider dropping the branch or adding the guard for consistency.
  • metadataFilePath = new Path(...).toUri.toString is computed for every data file even when no __paimon_file_path column is projected (filePathMetadataIndex < 0) and then discarded — could be guarded behind filePathMetadataIndex >= 0.
  • containsName / isFilePathMetadataColumn / isBucketMetadataColumn re-fetch SQLConf.get.resolver per call inside the per-field/per-file loop, though computePlan already binds resolver.
  • In isPaimonMetadataColumn, the containsName(PaimonMetadataColumns, name) clause is subsumed by the startsWith("__paimon_") prefix check, so the PaimonMetadataColumns set is redundant.
  • dataFile.externalPath().orElse(s"...") eagerly builds the fallback string even when externalPath() is present; orElseGet(...) would defer it.

None of these block the change.

@SteNicholas SteNicholas merged commit 8145cc9 into apache:master Jun 30, 2026
123 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Paimon V2 native scan does not handle metadata columns correctly

3 participants